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Lexicostatistic and language similarity clusters are useful for computational linguistic 
researches that depends on language similarity or cognate recognition. Nevertheless, 
there are no published lexicostatistic/language similarity cluster of Indonesian ethnic 
languages available. We formulate an approach of creating language similarity clusters 
by utilizing ASJP database to generate the language similarity matrix, then generate 
the hierarchical clusters with complete linkage and mean linkage clustering, and further 
extract two stable clusters with high language similarities. We introduced an extended 
k-means clustering semi-supervised learning to evaluate the stability level of the hierar- 
chical stable clusters being grouped together despite of changing the number of cluster. 
The higher the number of the trial, the more likely we can distinctly find the two hierar- 


K-means Clustering 


à : . chical stable clusters in the generated k-clusters. However, for all five experiments, the 
Semi-Supervised Clustering 


stability level of the two hierarchical stable clusters is the highest on 5 clusters. There- 
fore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, 
we plot the generated 5 clusters to a geographical map. 
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1. INTRODUCTION 


Nowadays, machine-readable bilingual dictionaries are being utilized in actual services [1] to support 
intercultural collaboration [2, 3, 4] and other research domains [5, 6, 7, 8, 9], but low-resource languages lack 
such sources. Indonesia has a population of 221,398,286 and 707 living languages which cover 57.8% of Aus- 
tronesian Family and 30.7% of languages in Asia [10]. There are 341 Indonesian ethnic languages facing various 
degree of language endangerment (trouble / dying) where some of the native speaker do not speak Bahasa In- 
donesia well since they are in remote areas. Unfortunately, there are 13 Indonesian ethnic languages which 
already extinct. In order to save low-resource languages like Indonesian ethnic languages from language endan- 
germent, prior works tried to enrich the basic language resource, i.e., bilingual dictionary [11, 12, 13, 14]. Those 
previous researchers require lexicostatistic/language similarity clusters of the low-resource languages to select 
the target languages. However, to the best of our knowledge, there are no published lexicostatistic/language 
similarity clusters of Indonesian ethnic languages. To fill the void, we address this research goal: Formulating an 
approach of creating a language similarity cluster. We first obtain 40-item word lists from the Automated Simi- 
larity Judgment Program (ASJP), further generate the language similarity matrix, then generate the hierarchical 
and k-means clusters, and finally plot the generated clusters to a map. 
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2. AUTOMATED SIMILARITY JUDGMENT PROGRAM 

Historical linguistics is the scientific study of language change over time in term of sound, analogical, 
lexical, morphological, syntactic, and semantic information [15]. Comparative linguistics is a branch of histor- 
ical linguistics that is concerned with language comparison to determine historical relatedness and to construct 
language families [16]. Many methods, techniques, and procedures have been utilized in investigating the poten- 
tial distant genetic relationship of languages, including lexical comparison, sound correspondences, grammatical 
evidence, borrowing, semantic constraints, chance similarities, sound-meaning isomorphism, etc [17]. The ge- 
netic relationship of languages is used to classify languages into language families. Closely-related languages 
are those that came from the same origin or proto-language, and belong to the same language family. 

Swadesh List is a classic compilation of basic concepts for the purposes of historical-comparative lin- 
guistics. It is used in lexicostatistics (quantitative comparison of lexical cognates) and glottochronology (chrono- 
logical relationship between languages). There are various version of swadesh list with a number of words equal 
225 [18], 215 & 200 [19], and lastly 100 [20]. To find the best size of the list, Swadesh states that “The only 
solution appears to be a drastic weeding out of the list, in the realization that quality is at least as important as 
quantity. Even the new list has defects, but they are relatively mild and few in number.” [21] 

A widely-used notion of string/lexical similarity is the edit distance or also known as Levenshtein 
Distance (LD): the minimum number of insertions, deletions, and substitutions required to transform one string 
into the other [22]. For example, LD between “kitten” and “sitting” is 3 since there are three transformations 
needed: kitten sitten (substitution of “s” for “k’’), sitten sittin (substitution of for “e”), and finally sittin 
sitting (insertion of “g” at the end). 

There are a lot of previous works using Levenshtein Distances such as dialect groupings of Irish Gaelic 
[23] where they gather the data from questionnaire given to native speakers of Irish Gaelic in 86 sites. They 
obtain 312 different Gaelic words or phrases. Another work is about dialect pronunciation differences of 360 
Dutch dialects [24] which obtain 125 words from Reeks Nederlandse Dialectatlassen. They normalize LD by 
dividing it by the length of the longer alignment. [25] measure linguistic similarity and intelligibility of 15 
Chinese dialects and obtain 764 common syllabic units. [26] define lexical distance between two words as the 
LD normalized by the number of characters of the longer of the two. [27] extend Petroni definition as LDND 
and use it in Automated Similarity Judgment Program (ASJP). 

The ASJP, an open source software was proposed by [28] with the main goal of developing a database 
of Swadesh lists [21] for all of the world’s languages from which lexical similarity or lexical distance matrix be- 
tween languages can be obtained by comparing the word lists. The classification is based on 100-item reference 
list of Swadesh [20] and further reduced to 40 most stable items [29]. The item stability is a degree to which 
words for an item are retained over time and not replaced by another lexical item from the language itself or a 
borrowed element. Words resistant to replacement are more stable. Stable items have a greater tendency to yield 
cognates (words that have a common etymological origin) within groups of closely related languages. 
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3. LANGUAGE SIMILARITY CLUSTERING APPROACH 

We formalize an approach to create language similarity clusters by utilizing ASJP database to generate 
the language similarity matrix, then generate the hierarchical clusters, and further extract the stable clusters 
with high language similarities. The hierarchical stable clusters are evaluated utilizing our extended k-means 
clustering. Finally, the obtained k-means clusters are plotted to a geographical map. The flowchart of the whole 
process is shown in Figure 1. 

In this paper, we focus on Indonesian ethnic languages. We obtain words list of 119 Indonesian ethnic 
languages with the number of speakers at least 100,000. However, it is difficult to classify 119 languages and 
obtain a valuable information from the generated clusters, therefore, we further filtered the target languages 
based on the number of speaker and availability of the language information in Wikipedia. We obtain 32 target 
languages as shown in Table 1 from the intersection between 46 Indonesian ethnic languages with number of 
speaker above 300,000 provided by Wikipedia and 119 Indonesian ethnic languages with number of speaker 
above 100,000 provided by ASJP. 

We further generate the similarity matrix of those 32 languages as shown in Figure 2. We added a 
white-red color scale where white color means the two languages are totally different (0% similarity) and the 
reddest color means the two languages are exactly the same (100% similarity). For a better clarity and to avoid 
redundancy, we only show the bottom-left part of the table. The headers follow the language code in Table 1. 
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Figure 1. Flowchart of Generating Language Similarity Clusters 


Code Population Language | Code Population Language 
Li 232004800 INDONESIAN | L 17 1000000 GORONTALO 
L2 84300000 OLD OR MIDDLE JAVANESE | L 18 1000000 JAMBI MALAY 
L3 34000000 SUNDANESE | L19 900000 MANGGARAI 
L4 15848500 MALAY | L 20 770000 NIAS NORTHERN 
L5 15848500 PALEMBANG MALAY | L21 750000 BATAK ANGKOLA 
L6 6770900 MADURESE | L22 700000 UAB METO 
L7 5530000 MINANGKABAU | L23 600000 KARO BATAK 
L8 5000000 BUGINESE | L24 500000 BIMA 
L9 5000000 BETAWI | L25 470000 KOMERING 
L10 3502300 BANJARESE MALAY | L26 350000 REJANG 
L11 3500032 ACEH | L27 331000 TOLAKI 
L12 3330000 BALI | L28 300000 GAYO 
L13 2130000 MAKASAR | L29 300000 MUNA 
L14 2100000 SASAK | L30 250000 TAE 
L15 2000000 TOBA BATAK | L31 245020 AMBONESE MALAY 
L16 1100000 BATAK MANDAILING | L32 230000 MONGONDOW 
Li L2 L3 L4 L5 L6 L7 LB L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27 L28 L29 L30 L31 
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Figure 2. Lexicostatistic / Similarity Matrix of 32 Indonesian Ethnic Languages by ASJP (%) 
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Hierarchical clustering is an approach which builds a hierarchy from the bottom-up, and does not re- 


quire us to specify the number of clusters beforehand. The algorithm works as follows: (1) Put each data point in 
its own cluster; (2) Identify the closest two clusters and combine them into one cluster; (3) Repeat the above step 
until all the data points are in a single cluster. Once this is done, it is usually represented by a dendrogram like 
structure. There are a few ways to determine how close two clusters are: (1) Complete linkage clustering: find 
the maximum possible distance between points belonging to two different clusters; (2) Single linkage cluster- 
ing: find the minimum possible distance between points belonging to two different clusters; (3) Mean/Average 
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linkage clustering: find all possible pairwise distances for points belonging to two different clusters and then 
calculate the average; (4) Centroid linkage clustering: find the centroid of each cluster and calculate the distance 
between centroids of two clusters. Complete linkage and mean (average) linkage clustering are the ones used 
most often. We generate the distance matrix from the similarity matrix shown in Figure 2 and further generate 
the hierarchical clusters with hclust function with a complete linkage clustering method as shown in Figure 3(a) 
and a mean linkage clustering method as shown in Figure 3(b) using R, a free software environment for statistical 
computing and graphics. 
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Figure 3. Hierarchical Clusters Dendogram of 32 Indonesian Ethnic Languages. 


From those two hierarchical clusters in Figure 3, we select two stable clusters that always grouped to- 
gether despite of changing the linkage clustering method. The first cluster consists of TOBA BATAK, BATAK 
MANDAILING, and BATAK ANGKOLA, while the second cluster consists of MINANGKABAU, BETAWI, 
AMBONESE MALAY, BANJARESE MALAY, PALEMBANG MALAY, JAMBI MALAY, MALAY, and In- 
donesia. Since the two stable custers have language similarities above 50% between the languages, they are 
good clusters to be referred when selecting target languages for computational linguistic researches that de- 
pends on language similarity or cognate recognition for inducing bilingual lexicons from the target languages 
[11, 12, 14, 30]. The two clusters are actually enough for selecting the target languages for those researches. 
However, we still need to evaluate the stability of those clusters and we also need to identify the low language 
similarities clusters in order to grasp the whole picture of Indonesian ethnic languages. Thus, we utilize the 
alternative clustering approach which is a k-means clustering. 


K-means clustering is an unsupervised learning algorithm that tries to cluster data based on their sim- 
ilarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to 
find patterns in the data. In k-means clustering, we have to specify the number of clusters we want the data to 
be grouped into. The algorithm works as follows: (1) The algorithm randomly assigns each observation to a 
cluster, and finds the centroid of each cluster; (2) Then, the algorithm iterates through two steps: (2a) Reassign 
data points to the cluster whose centroid is closest; (2b) Calculate new centroid of each cluster. These two steps 
are repeated until the within cluster variation cannot be reduced any further. The within cluster variation is 
calculated as the sum of the euclidean distance between the data points and their respective cluster centroids. 


It is well known that standard agglomerative hierarchical clustering techniques are not tolerant to noise 
[31, 32]. There are many previous works on finding clusters which robust to noise [33, 34, 35]. However, to 
evaluate the stability of the hierarchical stable clusters, we introduced a simple approach of calculating their 
stability level of being grouped together despite of changing the number of k-means clusters. We extend the k- 
means clustering unsupervised learning to a k-means clustering semi-supervised learning as shown in Algorithm 
1 by labeling the two hierarchical stable clusters beforehand. 
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Algorithm 1: Cluster Stability Evaluator 


Input: similarityMatriz, stableClusters, minimumK, maximumT rial; 
Output: stabilityLevel 
trial + 1; 
currentK + minimumkK; 
maximumK + length(similarityMatriz); 
scale2D + cmdscale(similarityM atrix); // multidimensional to 2D scaling 
while currentkh <= maximumkK do 
success fulT rial < 0; // initialized for each currentk 
while trial <= maximumTrial do 
kClusters + kmeans(scale2D, currentk); 
if stableClusters distinctly found in kClusters then 
success fulT rial + +; 
trial + +; // try again with the same number of cluster (currentK) 
end 
end 
stabilityLevel|currentK] + success fulTrial/maximumT rial; 
currentK + +; // increase the number of clusters 
16 trial 4+ 1 // reset the number of trial 
17 end 
18 return stabilityLevel; 
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4. RESULT AND DISCUSSION 

Initially, we manually conduct several trials to estimate the minimum and maximum number of k-means 
cluster to obtain clusters which consist of the stable clusters distinctly. Based on the initial trials, we estimate 
the minamum, = 4 and maximum, = 21. Then, we calculate the stability level of the two hierarchical stable 
clusters where the number of clusters ranging from minimum, = 4 to maximum, = 21 following Algorithm 
1. We have five sets of experiments with the maximum,rial equals 50, 500, 5,000, 50,000, and 500,000. In 
each experiment, a stability level of the two hierarchical stable clusters is measured for each number of k-means 
clusters by calculating the success rate of obtaining the two hierarchical stable clusters in the generated k-clusters 
as shown in Figure 4. 

The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable 
clusters in the generated k-clusters with a big number of clusters. For example, within 50 trials, we can not find 
the two hierarchical stable clusters distinctly in the generated k-clusters for big number of clusters (k > 14). 
However, within 50,000 and 500,000 trials, we can find the two hierarchical stable clusters distinctly in the 
generated k-clusters for all number of clusters between the minimum, = 4 and the maximum, = 21, even 
though the success rate is getting lower as the number of clusters increases. For all five experiments, the stability 
level of the two hierarchical stable clusters is the highest (0.78) on 5 clusters. 

Therefore, we take the 5 clusters as shown in Figure 5 as the best clusters of Indonesian ethnic languages 
to be referred when selecting target languages for computational linguistic researches that depends on language 
similarity or cognate recognition. We further plot the 5 clusters to a geographical map as shown in Figure 6. 


(a) 50 Trials (b) 500 Trials 


Figure 4. Obtaining Stable Clusters in n Trials 
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Figure 5. K-means Clusters of 32 Indonesian Ethnic Languages — 5 Clusters 


Figure 6. Similarity Clusters Map of 32 Indonesian Ethnic Languages — 5 Clusters 


5. CONCLUSION 


We utilized ASJP database to generate the language similarity matrix, then generate the hierarchical 
clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with the 
highest language similarities. We apply our extended k-means clustering semi-supervised learning to evaluate 
the stability level of the hierarchical stable clusters being grouped together despite of changing the number of 
clusters. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable 
clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical 
stable clusters is the highest (0.78) on 5 clusters. Therefore, we take the 5 clusters as the best clusters of 
Indonesian ethnic languages to be referred to select target languages for computational linguistic researches that 
depends on language similarity or cognate recognition. Finally, we plot the generated 5 clusters to a geographical 
map. Our algorithm can be used to find and evaluate other stable clusters of Indonesian ethnic languages or other 
language sets. 
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